random reshuffling
- North America > United States (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- North America > United States (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Random Reshuffling: Simple Analysis with Vast Improvements
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (6 more...)
RandomShufflingBeatsSGDOnlyAfterMany EpochsonIll-ConditionedProblems
However, known lower bounds ignore the problem's geometry,including itscondition number,whereas theupper bounds explicitly depend on it. Perhaps surprisingly, we prove that when the condition number is taken into account, without-replacement SGDdoesnotsignificantly improveon withreplacement SGD in terms of worst-case bounds, unless the number of epochs (passes overthedata) islargerthanthecondition number.
- North America > United States > Washington > King County > Bellevue (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- Asia > China > Beijing > Beijing (0.04)
Don't Compress Gradients in Random Reshuffling: Compress Gradient Differences
Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a distributed variant of random reshuffling with gradient compression (Q-RR), and show how to reduce the variance coming from gradient quantization through the use of control iterates. Next, to have a better fit to Federated Learning applications, we incorporate local computation and propose a variant of Q-RR called Q-NASTYA. Q-NASTYA uses local gradient steps and different local and global stepsizes. Next, we show how to reduce compression variance in this setting as well. Finally, we prove the convergence results for the proposed methods and outline several settings in which they improve upon existing algorithms.
Random Reshuffling: Simple Analysis with Vast Improvements
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large. We remove these 3 assumptions, improve the dependence on the condition number from $\kappa^2$ to $\kappa$ (resp.\
Random Reshuffling is Not Always Better
Many learning algorithms, such as stochastic gradient descent, are affected by the order in which training examples are used. It is often observed that sampling the training examples without-replacement, also known as random reshuffling, causes learning algorithms to converge faster. We give a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffling in learning algorithms (Recht and Ré, Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences, COLT 2012). We use this to give an example of a learning task and algorithm for which with-replacement random sampling actually outperforms random reshuffling.
- North America > United States (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)